Ordinary least squares

Estimator and diagnostics

MACS 33001 University of Chicago

\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}\]

Regression

  • Regression
  • Response variable \(Y\)
  • Covariate \(X\)
  • Regression function

    \[r(x) = \E (Y | X = x) = \int y f(y|x) dy\]

  • Estimate the form

    \[(Y_1, X_1), \ldots, (Y_n, X_n) \sim F_{X,Y}\]

Simple linear regression

\[r(x) = \beta_0 + \beta_1 x\]

  • \(X_i\) is one-dimensional
  • \(r(x)\) assumed to be linear
  • \(\Var (\epsilon_i | X = x) = \sigma^2\) does not depend on \(x\)

Linear regression model

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

  • \(\E (\epsilon_i | X_i) = 0\)
  • \(\Var (\epsilon_i | X_i) = \sigma^2\)
  • Unknown \(\beta_0\), \(\beta_1\), and \(\sigma^2\)
  • \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
  • Fitted line

    \[\hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1 x\]

  • Fitted values

    \[\hat{Y}_i = \hat{r}(X_i)\]

  • Residuals

    \[\hat{\epsilon}_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x)\]

  • Residual sum of squares

    \[RSS = \sum_{i=1}^n \hat{\epsilon}_i^2\]

Estimation strategy

Estimation strategy

Estimation strategy

  • Unbiased
    • \(E(\hat{\beta}) = \beta\)
  • Efficient
    • \(\min(Var(\hat{\beta}))\)

Least squares estimator

  • Values \(\hat{\beta}_0, \hat{\beta}_1\) that

    \[\min(RSS)\]

\[ \begin{aligned} \hat \beta_1 & = \dfrac{ \sum_{i=1}^n Y_iX_i - \bar{Y}_n \sum_{i=1}^nX_i}{ \sum_{i=1}^n X_i^2 - \bar{X}_n\sum_{i=1}^nX_i}\\ \hat \beta_0 & = \bar{Y}_n - \hat{\beta}_1 \bar{X}_n \\ \hat{\sigma}^2 &= \left( \frac{1}{n - 2} \right) \sum_{i=1}^n \hat{\epsilon}_i^2 \end{aligned} \]

Least squares estimator

Multivariate formulation

\[\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u}\]

  • \(\mathbf{Y}\): \(N\times 1\) vector
  • \(\mathbf{X}\): \(N \times K\) matrix
  • \(\boldsymbol{\beta}\): \(K \times 1\) vector
  • \(\mathbf{u}\): \(N\times 1\) vector
  • \(i \in \{1,\ldots,N \}\)
  • \(k \in \{1,\ldots,K \}\)

Multivariate formulation

\[(\mathbf{X'X})^{-1}\mathbf{X'Y} = \boldsymbol{\beta}\]

Maximum likelihood estimator

  • \(\epsilon_i | X_i \sim N(0, \sigma^2\)

    \[Y_i | X_i \sim N(\mu_i, \sigma^2)\]

    • \(\mu_i = \beta_0 + \beta_1 X_i\)
  • Likelihood function

    \[ \begin{align} \prod_{i=1}^n f(X_i, Y_i) &= \prod_{i=1}^n f_X(X_i) f_{Y | X} (Y_i | X_i) \\ &= \prod_{i=1}^n f_X(X_i) \times \prod_{i=1}^n f_{Y | X} (Y_i | X_i) \\ &= \Lagr_1 \times \Lagr_2 \end{align} \]

    \[ \begin{align} \Lagr_1 &= \prod_{i=1}^n f_X(X_i) \\ \Lagr_2 = \prod_{i=1}^n f_{Y | X} (Y_i | X_i) \end{align} \]

  • Conditional likelihood

    \[ \begin{align} \Lagr_2 &\equiv \Lagr(\beta_0, \beta_1, \sigma^2) \\ &= \prod_{i=1}^n f_{Y | X}(Y_i | X_i) \\ &\propto \frac{1}{\sigma} \exp \left\{ -\frac{1}{2\sigma^2} \sum_{i=1}^n (Y_i - \mu_i)^2 \right\} \end{align} \]

  • Conditional log-likelihood

    \[\lagr(\beta_0, \beta_1, \sigma^2) = -n \log(\sigma) - \frac{1}{2\sigma^2} \left( Y_i - (\beta_0 + \beta_1 X_i) \right)^2\]

    • \(\max \left( \lagr(\beta_0, \beta_1, \sigma^2) \right)\)
    • Equivalent to \(\min(RSS)\)

      \[RSS = \sum_{i=1}^n \hat{\epsilon}_i^2 = \sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x) \right)\]

Properties of the least squares estimator

  • Conditional on \(X^n = (X_1, \ldots, X_n)\)
  • \(\hat{\beta}^T = (\hat{\beta}_0, \hat{\beta}_1)^T\)

    \[ \begin{align} \E (\hat{\beta} | X^n) &= \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} \\ \Var (\hat{\beta} | X^n) &= \frac{\sigma^2}{n s_X^2} \begin{pmatrix} \frac{1}{n} \sum_{i=1}^n X_i^2 & -\bar{X}^n \\ -\bar{X}^n & 1 \end{pmatrix} \end{align} \]

    \[s_X^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2\]

Properties of the least squares estimator

\[ \begin{align} \widehat{\se} (\hat{\beta}_0) &= \frac{\hat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{ \sum_{i=1}^n X_i^2}{n}} \\ \widehat{\se} (\hat{\beta}_1) &= \frac{\hat{\sigma}}{s_X \sqrt{n}} \end{align} \]

Properties of the least squares estimator

  1. Consistency

    \[\hat{\beta}_0 \xrightarrow{P} \beta_0 \, \text{and} \, \hat{\beta}_1 \xrightarrow{P} \beta_1\]

  2. Asymptotic normality

    \[\frac{\hat{\beta}_0 - \beta_0}{\widehat{\se}(\hat{\beta}_0)} \leadsto N(0,1) \, \text{and} \, \frac{\hat{\beta}_1 - \beta_1}{\widehat{\se}(\hat{\beta}_1)} \leadsto N(0,1)\]

  3. Approximate \(1 - \alpha\) confidence intervals for \(\beta_0\) and \(\beta_1\) are

    \[\hat{\beta}_0 \pm z_{\alpha / 2} \widehat{\se}(\hat{\beta}_0) \, \text{and} \, \hat{\beta}_1 \pm z_{\alpha / 2} \widehat{\se}(\hat{\beta}_1)\]

  4. The Wald test for testing \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \neq 0\) is reject \(H_0\) if \(\mid W \mid > z_{\alpha / 2}\) where

    \[W = \frac{\hat{\beta}_1}{\widehat{\se}(\hat{\beta}_1)}\]

Assumptions of linear regression models

  • Linearity
  • Constant variance
  • Normality
  • Independence
  • Fixed \(X\)
  • \(X\) is not invariant

Linearity

\[\E(\epsilon_i) \equiv E(\epsilon_i | X_i) = 0\]

\[ \begin{aligned} \mu_i \equiv E(Y_i) \equiv E(Y | X_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ \mu_i &= \beta_0 + \beta_1 X_i + E(\epsilon_i) \\ \mu_i &= \beta_0 + \beta_1 X_i + 0 \\ \mu_i &= \beta_0 + \beta_1 X_i \end{aligned} \]

Constant variance

  • The variance of the errors is the same regardless of the values of \(X\)

    \[\Var(\epsilon_i | X_i) = \sigma^2\]

Normality

  • The errors are assumed to be normally distributed

    \[\epsilon_i \mid X_i \sim N(0, \sigma^2)\]

Independence

  • Observations are sampled independently from one another
  • Any pair of errors \(\epsilon_i\) and \(\epsilon_j\) are independent for \(i \neq j\).

Fixed \(X\)

  • \(X\) is assumed to be fixed or measured without error and independent of the error
  • Experimental design
  • Observational study

    \[\epsilon_i \sim N(0, \sigma^2), \text{for } i = 1, \dots, n\]

\(X\) is not invariant

Handling violations of assumptions

  • Ooopsie
  • Inference can be
    • Tricky
    • Biased
    • Inefficient
    • Error prone
  • Move to more robust inferential method
  • Diagnose assumption violations and solve them in the OLS framework

Unusual and influential data

  • Outliers
    • Unusual observations
    • Due to \(Y_i\)
    • Due to \(X_i\)
    • Some combination of the two
  • Potential disproportionate influence on the regression model

Terms

  • Outlier
  • Leverage
  • Discrepancy
  • Influence

    \[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]

Unusual and influential data

Measuring leverage

  • Hat statistic

    \[h_i = \frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{j=1}^{n} (X_{j} - \bar{X})^2}\]

  • Measures the contribution of observation \(Y_i\) to the fitted value \(\hat{Y}_j\)
  • Solely a function of \(X\)
  • Larger values indicate higher leverage

Measuring discrepancy

  • Use the residuals?
  • Standardized residual

    \[\hat{\epsilon}_i ' \equiv \frac{\hat{\epsilon}_i}{S_{E} \sqrt{1 - h_i}}\]

    • \(S_E = \sqrt{\frac{\hat{\epsilon}_i^2}{(n - k - 1)}}\)
  • Studentized residual

    \[\hat{\epsilon}_i^{\ast} \equiv \frac{\hat{\epsilon}_i}{S_{E(-i)} \sqrt{1 - h_i}}\]

Measuring influence

  • \(\text{DFBETA}_{ij}\)

    \[D_{ij} = \hat{\beta_1j} - \hat{\beta}_{1j(-i)}, \text{for } i=1, \dots, n \text{ and } j = 0, \dots, k\]

  • \(\text{DFBETAS}_{ij}\)

    \[D^{\ast}_{ij} = \frac{D_{ij}}{SE_{-i}(\beta_{1j})}\]

  • Cook’s D

    \[D_i = \frac{\hat{\epsilon}^{'2}_i}{k + 1} \times \frac{h_i}{1 - h_i}\]

Visualizing leverage, discrepancy, and influence

  • Number of federal laws struck down by the U.S. Supreme Court in each Congress
  1. Age
  2. Tenure
  3. Unified

OLS model

## # A tibble: 4 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept) -12.1       2.54       -4.76 0.00000657
## 2 age           0.219     0.0448      4.88 0.00000401
## 3 tenure       -0.0669    0.0643     -1.04 0.300     
## 4 unified       0.718     0.458       1.57 0.121

Outliers?

Outliers?

Hat-values

  • Anything exceeding twice the average \(\bar{h} = \frac{k + 1}{n}\) is noteworthy

    ## # A tibble: 9 x 10
    ##   Congress congress nulls   age tenure unified  year    hat student  cooksd
    ##   <chr>       <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
    ## 1 1st             1     0  49.8  0.800       1  1789 0.0974   0.330 0.00296
    ## 2 3rd             3     0  52.8  4.20        0  1793 0.113    0.511 0.00841
    ## 3 12th           12     0  49    6.60        1  1811 0.0802   0.669 0.00980
    ## 4 17th           17     0  59   16.6         1  1821 0.0887  -0.253 0.00157
    ## 5 20th           20     0  61.7 17.4         1  1827 0.0790  -0.577 0.00719
    ## 6 23rd           23     0  64   18.4         1  1833 0.0819  -0.844 0.0159 
    ## 7 34th           34     0  64   14.6         0  1855 0.0782  -0.561 0.00671
    ## 8 36th           36     0  68.7 17.8         0  1859 0.102   -1.07  0.0326 
    ## 9 99th           99     3  71.9 16.7         0  1985 0.0912   0.295 0.00221

Studentized residuals

  • Anything outside of the range \([-2,2]\) is discrepant

    ## # A tibble: 7 x 10
    ##   Congress congress nulls   age tenure unified  year    hat student cooksd
    ##   <chr>       <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>
    ## 1 67th           67     6  66     9          1  1921 0.0361    2.14 0.0415
    ## 2 74th           74    10  71.1  14.2        1  1935 0.0514    4.42 0.223 
    ## 3 90th           90     6  64.7  13.3        1  1967 0.0195    2.49 0.0292
    ## 4 91st           91     6  65.1  13          1  1969 0.0189    2.42 0.0269
    ## 5 92nd           92     5  62     9.20       1  1971 0.0146    2.05 0.0150
    ## 6 98th           98     7  69.9  14.7        0  1983 0.0730    3.02 0.165 
    ## 7 104th         104     8  60.6  12.5        1  1995 0.0208    4.48 0.0897

Influence

  • \(D_i > \frac{4}{n - k - 1}\)

    ## # A tibble: 4 x 10
    ##   Congress congress nulls   age tenure unified  year    hat student cooksd
    ##   <chr>       <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>
    ## 1 67th           67     6  66      9         1  1921 0.0361    2.14 0.0415
    ## 2 74th           74    10  71.1   14.2       1  1935 0.0514    4.42 0.223 
    ## 3 98th           98     7  69.9   14.7       0  1983 0.0730    3.02 0.165 
    ## 4 104th         104     8  60.6   12.5       1  1995 0.0208    4.48 0.0897

How to treat unusual observations

  • Mistake
  • Weird observation
  • Why is it weird?

Omit unusual observations

## [1] 0.232
## [1] 0.258
## [1] 1.68
## [1] 1.29

Non-normally distributed errors

\[\epsilon_i | X_i \sim N(0, \sigma^2)\]

  • Under CLT, inference based on the least-squares estimator is approximately valid under broad conditions
  • Violation does not effect validity
  • Violation effects efficiency

Detecting non-normally distributed errors

## # A tibble: 3,997 x 4
##      age sex    compositeHourlyWages yearsEducation
##    <int> <chr>                 <dbl>          <int>
##  1    40 Male                  10.6              15
##  2    19 Male                  11                13
##  3    46 Male                  17.8              14
##  4    50 Female                14                16
##  5    31 Male                   8.2              15
##  6    30 Female                17.0              13
##  7    61 Female                 6.7              12
##  8    46 Female                14                14
##  9    43 Male                  19.2              18
## 10    17 Male                   7.25             11
## # ... with 3,987 more rows
## # A tibble: 4 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -8.12    0.599       -13.6 5.27e- 41
## 2 sexMale           3.47    0.207        16.8 4.04e- 61
## 3 yearsEducation    0.930   0.0343       27.1 5.47e-149
## 4 age               0.261   0.00866      30.2 3.42e-180

## [1]  967 3911

Detecting non-normally distributed errors

Fixing non-normally distributed errors

  • Power transformations
  • Log transformations

Fixing non-normally distributed errors

## # A tibble: 4 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      1.10    0.0380        28.9 1.97e-167
## 2 sexMale          0.224   0.0131        17.1 2.16e- 63
## 3 yearsEducation   0.0559  0.00217       25.7 2.95e-135
## 4 age              0.0182  0.000549      33.1 4.50e-212

## [1] 2760 3267

Non-constant error variance

  • Homoscedasticity

    \[\text{Var}(\epsilon_i) = \sigma^2\]

    \[\widehat{\se}(\hat{\beta}_{1j}) = \sqrt{\hat{\sigma}^{2} (X'X)^{-1}_{jj}}\]

  • Heteroscedasticity

Detecting heteroscedasticity

\[Y_i = 2 + 3X_i + \epsilon\]

\[\epsilon_i \sim N(0,1)\]

Detecting heteroscedasticity

\[Y_i = 2 + 3X_i + \epsilon_i\]

\[\epsilon_i \sim N(0,\frac{X}{2})\]

Breusch-Pagan test

  • Estimate an OLS model and obtain the squared residuals \(\hat{\epsilon}^2\)
  • Regress \(\hat{\epsilon}^2\) against:
    • All the \(k\) variables you think might be causing the heteroscedasticity
  • Calculate \(R^2_{\hat{\epsilon}^2}\) for the residual model and multiply it by the number of observations \(n\)
    • \(\chi^2_{(k-1)}\) distribution

Breusch-Pagan test

## 
##  studentized Breusch-Pagan test
## 
## data:  sim_homo_mod
## BP = 0.3, df = 1, p-value = 0.6
## 
##  studentized Breusch-Pagan test
## 
## data:  sim_hetero_mod
## BP = 200, df = 1, p-value <2e-16

Weighted least squares regression

  • \(\epsilon_i \sim N(0, \sigma_i^2)\)

    \[ \begin{bmatrix} \sigma_1^2 & 0 & 0 & 0 \\ 0 & \sigma_2^2 & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \sigma_n^2 \\ \end{bmatrix} \]

We can define the reciprocal of each variance \(\sigma_i^2\) as the weight \(w_i = \frac{1}{\sigma_i^2}\), then let matrix \(\mathbf{W}\) be a diagonal matrix containing these weights:

\[ \mathbf{W} = \begin{bmatrix} \frac{1}{\sigma_1^2} & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2^2} & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \frac{1}{\sigma_n^2} \\ \end{bmatrix} \]

So rather than following the traditional linear regression estimator

\[\hat{\mathbf{\beta_1}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}\]

we can substitute in the weighting matrix \(\mathbf{W}\):

\[\hat{\mathbf{\beta_1}} = (\mathbf{X}' \mathbf{W} \mathbf{X})^{-1} \mathbf{X}' \mathbf{W} \mathbf{y}\]

\[\sigma_{i}^2 = \frac{\sum(w_i \hat{\epsilon}_i^2)}{n}\]

This is equivalent to minimizing the weighted sum of squares, according greater weight to observations with smaller variance.

How do we estimate the weights \(W_i\)?

  1. Use the residuals from a preliminary OLS regression to obtain estimates of the error variance within different subsets of observations.
  2. Model the weights as a function of observable variables in the model.

For example, using the first approach on our original SLID model:

## # A tibble: 4 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -8.12    0.599       -13.6 5.27e- 41
## 2 sexMale           3.47    0.207        16.8 4.04e- 61
## 3 yearsEducation    0.930   0.0343       27.1 5.47e-149
## 4 age               0.261   0.00866      30.2 3.42e-180
## # A tibble: 4 x 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)      -8.10   0.0160       -506.       0
## 2 sexMale           3.47   0.00977       356.       0
## 3 yearsEducation    0.927  0.00142       653.       0
## 4 age               0.261  0.000170     1534.       0

We see some mild changes in the estimated parameters, but drastic reductions in the standard errors. The problem is that this reduction is potentially biased through the estimated covariance matrix because the sampling error in the estimates should reflect the additional source of uncertainty, which is not explicitly accounted for just basing it on the original residuals. Instead, it would be better to model the weights as a function of relevant explanatory variables.1

Corrections for the variance-covariance estimates

Alternatively, we could instead attempt to correct for heteroscedasticity only in the standard error estimates. This produces the same estimated parameters, but adjusts the standard errors to account for the violation of the constant error variance assumption (we won’t falsely believe our estimates are more precise than they really are.) One major estimation procedure are Huber-White standard errors (also called robust standard errors) which can be recovered using the car::hccm() function:2

## # A tibble: 4 x 6
##   term           estimate std.error statistic   p.value std.error.rob
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)      -8.12    0.599       -13.6 5.27e- 41       0.636  
## 2 sexMale           3.47    0.207        16.8 4.04e- 61       0.207  
## 3 yearsEducation    0.930   0.0343       27.1 5.47e-149       0.0385 
## 4 age               0.261   0.00866      30.2 3.42e-180       0.00881

Notice that these new standard errors are a bit larger than the original model, accounting for the increased uncertainty of our parameter estimates due to heteroscedasticity.

Non-linearity in the data

Partial residual plots

Define the partial residual for the \(j\)th explanatory variable:

\[\hat{\epsilon}_i^{(j)} = \hat{\epsilon}_i + \hat{\beta}_j X_{ij}\]

In essence, calculate the least-squares residual (\(\hat{\epsilon}_i\)) and add to it the linear component of the partial relationship between \(Y\) and \(X_j\). Finally, we can plot \(X_j\) versus \(\hat{\epsilon}_i^{(j)}\) and assess the relationship. For instance, consider the results of the logged wage model from earlier:

The solid lines are GAMs, while the dashed lines are linear least-squares fits. For age, the partial relationship with logged wages is not linear - some transformation of age is necessary to correct this. For education, the relationship is more approximately linear except for the discrepancy for individual with very low education levels.

We can correct this by adding a squared polynomial term for age, and square the education term. The resulting regression model is:

\[\log(\text{Wage}) = \beta_0 + \beta_1(\text{Male}) + \beta_2 \text{Age} + \beta_3 \text{Age}^2 + \beta_4 \text{Education}^2\]

## # A tibble: 5 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)          0.397    0.0578         6.87 7.62e- 12
## 2 sexMale              0.221    0.0124        17.8  3.21e- 68
## 3 I(yearsEducation^2)  0.00181  0.0000786     23.0  1.19e-109
## 4 age                  0.0830   0.00319       26.0  2.93e-138
## 5 I(age^2)            -0.000852 0.0000410    -20.8  3.85e- 91

Because the model is now nonlinear in both age and education, we need to rethink how to draw the partial residuals plot. The easiest approach is to plot the partial residuals for both age and education against the original explanatory variable. For age, that is

\[\hat{\epsilon}_i^{\text{Age}} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i + \hat{\epsilon}_i\]

and for education,

\[\hat{\epsilon}_i^{\text{Education}} = 0.002 \times \text{Education}^2_i + \hat{\epsilon}_i\]

On the same graph, we also plot the partial fits for the two explanatory variables:

\[\hat{Y}_i^{(\text{Age})} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i\]

and for education,

\[\hat{Y}_i^{(\text{Education})} = 0.002 \times \text{Education}^2_i\]

On the graphs, the solid lines represent the partial fits and the dashed lines represent the partial residuals. If the two lines overlap significantly, then the revised model does a good job accounting for the nonlinearity.

Collinearity

Perfect collinearity

Perfect collinearity is incredibly rare, and typically involves using transformed versions of a variable in the model along with the original variable. For example, let’s estimate a regression model explaining mpg as a function of displ, wt, and cyl:

## 
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.403 -1.403 -0.495  1.339  6.072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.10768    2.84243   14.46  1.6e-14 ***
## disp         0.00747    0.01184    0.63   0.5332    
## wt          -3.63568    1.04014   -3.50   0.0016 ** 
## cyl         -1.78494    0.60711   -2.94   0.0065 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.815 
## F-statistic: 46.4 on 3 and 28 DF,  p-value: 5.4e-11

Now let’s say we want to recode displ so it is centered around it’s mean and re-estimate the model:

## 
## Call:
## lm(formula = mpg ~ disp + wt + cyl + disp_mean, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.403 -1.403 -0.495  1.339  6.072 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.10768    2.84243   14.46  1.6e-14 ***
## disp         0.00747    0.01184    0.63   0.5332    
## wt          -3.63568    1.04014   -3.50   0.0016 ** 
## cyl         -1.78494    0.60711   -2.94   0.0065 ** 
## disp_mean         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.815 
## F-statistic: 46.4 on 3 and 28 DF,  p-value: 5.4e-11

Oops. What’s the problem? disp and disp_mean are perfectly correlated with each other:

Because they perfectly explain each other, we cannot estimate a linear regression model that contains both variables.3 Fortunately R automatically drops the second variable so it can estimate the model. Because of this, perfect multicollinearity is rarely problematic in social science.

Less-than-perfect collinearity

Instead consider the credit dataset:

Age and limit are not strongly correlated with one another, so estimating a linear regression model to predict an individual’s balance as a function of age and limit is not a problem:

## # A tibble: 3 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -173.     43.8         -3.96 9.01e-  5
## 2 age           -2.29    0.672       -3.41 7.23e-  4
## 3 limit          0.173   0.00503     34.5  1.63e-121

But what about using an individual’s credit card rating instead of age? It is likely a good predictor of balance as well:

## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -378.       45.3       -8.34  1.21e-15
## 2 limit          0.0245    0.0638     0.384 7.01e- 1
## 3 rating         2.20      0.952      2.31  2.13e- 2

By replacing age with rating, we developed a problem in our model. The problem is that limit and rating are strongly correlated with one another:

In the regression model, it is difficult to parse out the independent effects of limit and rating on balance, because limit and rating tend to increase and decrease in association with one another. Because the accuracy of our estimates of the parameters is reduced, the standard errors increase. This is why you can see above that the standard error for limit is much larger in the second model compared to the first model.

Detecting collinearity

Scatterplot matrix

A correlation or scatterplot matrix would help to reveal any strongly correlated variables:

Here it is very clear that limit and rating are strongly correlated with one another.

Variance inflation factor (VIF)

Unfortunately correlation matrices may not be sufficient to detect collinearity if the correlation exists between three or more variables (aka multicollinearity) while not existing between any two pairs of these variables. Instead, we can calculate the variance inflation factor (VIF) which is the ratio of the variance of \(\hat{\beta}_{1j}\) when fitting the full model divided by the variance of \(\hat{\beta}_{1j}\) if fit on its own model. We can use the car::vif() function in R to calculate this statistic for each coefficient. A good rule of thumb is that a VIF statistic greater than 10 indicates potential multicollinearity in the model. Applied to the credit regression models above:

##   age limit 
##  1.01  1.01
##  limit rating 
##    160    160

Fixing multicollinearity

What not to do

Drop one or more of the collinear variables from the model

This is not a good idea, even if it makes your results “significant”. By omitting the variable, you are completely re-specifying your model in direct contradiction to your theory. If your theory suggests that a variable can be dropped, go ahead. But if not, then don’t do it.

What you could do instead

Add data

The more observations, the better. It could at least decrease your standard errors and give you more precise estimates. And if you add “odd” or unusual observations, it could also reduce the degree of multicollinearity.

Transform the covariates

If the variables are indicators of the same underlying concept, you can combine them into an index variable. This could be an additive index where you sum up comparable covariates or binary indicators. Alternatively, you could create an index via principal components analysis.

Shrinkage methods

Shrinkage methods involve fitting a model involving all \(p\) predictors and shrinking the estimated coefficients towards zero. This shrinkage reduces variance in the model. When multicollinearity is high, the variance of the estimator \(\hat{\beta}_1\) is also high. By shrinking the estimated coefficient towards zero, we may increase bias in exchange for smaller variance in our estimates.


  1. Omitted here for time purposes. You can find details on this estimation procedure on the internet.

  2. This function really returns the “sandwich” estimator of the variance-covariance matrix, so we need to further take the square root of the diagonal of this matrix.

  3. Basically we cannot invert the variance-covariance matrix of \(\mathbf{X}\) because the collinear columns in \(\mathbf{X}\) are perfectly linearly dependent on each other. Because of this, we cannot get parameter estimates or standard errors for the model.